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Abstract —We study optimal estimation for sparse principal 
component analysis when the number of non-zero elements is 
small but on the same order as the dimension of the data. We 
employ approximate message passing (AMP) algorithm and its 
state evolution to analyze what is the information theoretically 
minimal mean-squared error and the one achieved by AMP in 
the limit of large sizes. For a special case of rank one and 
large enough density of non-zeros Deshpande and Montanari (ij 
proved that AMP is asymptotically optimal. We show that both 
for low density and for large rank the problem undergoes a series 
of phase transitions suggesting existence of a region of parameters 
where estimation is information theoretically possible, but AMP 
(and presumably every other polynomial algorithm) fails. The 
analysis of the large rank limit is particnlarly instructive. 

I. Introduction 

Suppose we are given a data matrix Y G that was 

obtained from the following model 

Y = ^X'^X + W, (1) 

VJV 

where 26 is a matrix in Each of the X elements of X is 

an independent random variable in K’’ distributed according to 
Po(x). Further W G R^^^ is a symmetric noise matrix with 
elements distributed as A/'(0, A). The observation thus 
consists of a rank r matrix corrupted by a Gaussian noise. 
The main difficulty, but also interest, stems from the fact that 
we require X to be sparse: only a fraction p of elements of 
X are non-zero and Po{x) is constrained accordingly. 

Let us denote by Xq the true underlying signal matrix. We 
treat the problem of estimating X from Y. We evaluate and 
analyze the estimator X that minimizes the mean squared error 

MSE= 2e(||1-Xo|||) . (2) 

If the distribution Po{x) is known then such an estimator is 
given by the mean of the marginals of the posterior probability 
distribution P{X\Y). We analyze optimal estimation in the 
above model in the limit where the system size N is large, 
but rank is small r = 0(1), noise variance A = 0(1), and 
fraction of non-zero elements p = 0(1). 

We aim to answer the following two questions: (Ql) What 
is the information theoretically minimal mean-squared error 
(MMSE) in this limit? (Q2) In what range of parameters can 
the MMSE be achieved with a computationally tractable algo¬ 
rithm? We answer these question by studying the approximate 
message passing (AMP) algorithm to estimate the marginals 
of the posterior likelihood, and its asymptotic state evolution 


(SE). Our results rely on a conjecture from statistical physics, 
that the present problem belongs to a class of problems 
for which the fixed points of the state evolution describe 
asymptotically exactly both the optimal estimator and the 
performance of AMP. Moreover current experience from other 
such problems suggests that when AMP does not reach the 
MMSE then no other polynomial algorithm will. 

A. Motivation and background 

Principal component analysis (PCA) is a common dimen¬ 
sionality reduction technique that aims at describing the data as 
linear combination of a small number of principal components. 
In sparse PCA Q we search for principal components with 
many zero elements to facilitate the interpretation of the 
result. Only the non-zero elements then correspond to features 
relevant for describing the variability in the data. The sparsity 
constraint makes the problem algorithmically challenging. Let 
us note that when talking about sparse PCA we have more 
often in mind a model of a type Y = UV'^ + W, rather 
than ([T]|. For simplicity of presentation, in this short report, we 
restrict to the model Q, but applying our method to the UV'^ 
setting is straightforward and leads to comparable results. 

Abundance of applications of sparse PCA motivated both al¬ 
gorithmic development and theoretical studies of the problem, 
see e.g. 0-0. Many of these existing works are concerned 
with exact recovery of the support of non-zero elements. 
However, exact recovery of the support is possible only when 
the number of non-zeros is subextensive, i.e. the fraction of 
non-zeros p = o(l) 0. In our setting, we assume p = 0(1) 
which is reasonable in many applications. In this paper we 
model a typical case of sparse PCA by Q and analyze this 
model in Bayesian probabilistic setting where we assume the 
knowledge of the distribution Pq{x). 

For sparse PCA of rank one r = 1 the AMP algorithm and 
its state evolution have been derived by Rangan and Fletcher 
0. Deshpande and Montanari 0 were able to prove that for 
Bernoulli distributed coordinates (and always rank one) the 
state evolution equation indeed describes exactly the evolution 
of the algorithm at large sizes when the density of non-zeros 
p is large enough. Remarkably, they also proved that the 
asymptotic MSE achieved by AMP is in this case information 
theoretically optimal. Additionally, in the regime where their 
proof is valid they did not observe any phase transition in the 
MMSE. The corresponding AMP algorithm for generic rank 
was derived in ifTOl, however, without the state evolution. 


The result of Deshpande and Montanari about asymptotic 
optimality of AMP is surprising for at least two reasons: First, 
for the question of support recovery, there is a well known 
large gap between what is information theoretically possible 
and what is tractable with current algorithms 0- Moreover 
this gap was linked to the problem of planted clique Ug, 
where it is believed that no polynomial algorithm will be 
able to achieve the information theoretic performance. Second, 
in the regime where the rank is small but scales linearly 
with N an analogous gap between information theoretic and 
tractable algorithmic performance was predicted to exist |jTg. 
This motivates us to revisit the analysis of the model^m 
in particular in the region of small density p and for rank 
larger than one. Indeed, in both these cases we identify phase 
transitions in the MMSE, as well as regions where AMP is 
suboptimal. Technique-wise the contribution of the present 
paper is a generalization of the state evolution to arbitrary 
rank r, and analysis of the AMP-MSE and the MMSE for a 
range of distributions Po{x). 

II. Erom amp to state evolution 


A. AMP 

Here we remind the approximate message passing algorithm 
for general rank r fTO) , and sketch the derivation of the 
corresponding state evolution. Eor rank one our results reduce 
to the state evolution of 0, 0- 

AMP is a large-iV simplification of the belief propagation 
equations (BP) | fT3| for a graphical model corresponding to 
the posterior probability 

In the derivation we distinguish between P{x) ^ Po{x), but 
later we assume equality of the two. Variables in this graphical 
model are the r-component vectors for p = 

Belief propagation is written as an iterative procedure on 
message that are probability distributions over x^. 

Remarkably the large-iV expansion of the corresponding BP 
equations closes on messages G K’’ that are means of the 

BP messages. The AMP algorithm is a further simplification 
of the corresponding equations where the fact that messages 
depend only weakly on one of the index is exploited 
leading to a so called Onsager correction term. This leads to 
the following AMP iterative algorithm that is easily amenable 
to implementation 

a‘+^ = f(A\Bp, = (4) 

where G M’" G the arguments are from and 

respectively, and are given by 

= ( 5 ) 

- AV (E. ■ (6) 


The function f(A, B) G M'" is defined as the mean of the 
normalized probability distribution 

M{x,A,B) = j^^^^P{x)e-^^^"^^+^"^. (7) 

The AMP equations are usually initialized in such a way that 
is the mean of the prior distribution P{x) and its 

variance. 

Einally the AMP approach also provides an approximation 
for the log-likelihood (f) = log Z(Y), where Z(Y) is the 
normalization of 0. This is related to the Bethe free energy 
m simplified along the very same lines as BP was simplified 
into AMP. Given a fixed point of the AMP algorithm we 
compute the Bethe log-likelihood as 

'i^’= (8) 

where M{A, B) is the normalization from 0 and 

^ y y 

( 9 ) 


B. State evolution 

State evolution describes the behavior of the AMP algorithm 
along iterations via two order parameters from 




The mean squared error is related to these parameters as 
MSE = Tr[Ep„{a;oX[[) - 2M -f Q] . 


( 10 ) 

( 11 ) 

( 12 ) 


With this definition we have from 0 A* = Q*/A, and from 
0 by using 0 to express and neglecting sub-leading 
order terms, we derive that B* is a random Gaussian variable 
with mean {xq)^M^/A and variance Q*/A. The above order 
parameters hence follow the state evolution equations 


[f{^, ^xo + W)r {.,.) 
fd^^xo + W)x^ 


= E 


PoAo),Pw(W) 


,( 13 ) 

(14) 


where FF is a r-variate Gaussian random variable with zero 
mean and of covariance Q*/A, the arguments of / and 
in ([0 are the same. The Bethe log-likelihood can then be 
evaluated from the fixed point of the state evolution as 


= E 


PoAo),Pw{W) 


logA/'( 


Q Mxq 
A’ A 


AV) 




( 15 ) 


We recall that P{x) from 0 appears in ( T3][T4 i via the 
definition of the function f{A,B) in 0. 

In this paper we work in the so-called Bayes-optimal setting 
where we assume Po{x) = P{x), the state evolution then 
simplifies because M* = Q* for all t. This is called the 
Nishimori condition in statistical physics in and was also 
















derived in ||Tj. The intuition behind this condition is that in 
Bayes-optimal inference the ground true signal Xq behaves in 
exactly the same way as a random sample from the posterior 
distribution and hence Q that describes the overlap between 
two randomly chosen samples is the same as M that describes 
the overlap beween a randomly chosen sample and Xq. 

C. Statistical physics conjecture 

The belief propagation equations from which we derived the 
AMP and the state evolution assume that certain correlations 
between incoming messages are weak enough. In statistical 
physics this assumption is widely accepted to hold for in¬ 
ference in the Bayes optimal setting on models such as 
that correspond to a fully connected factor graph with weak 
interactions on factor nodes. This has been used in many 
works, see e.g. a more detailed discussion in and it has 
been proven in a subset of cases, see notably the closely related 
Q or |T4). 

Under the above assumption and in the limit of large 
N, the MMSE can be computed from a fixed point of the 
state evolution equations ( [T3][T^ that has the largest log- 
likelihood And the AMP-MSE can be computed from 
a stable fixed point of the state evolution that is reached 
iteratively from initialization = e for a very small e. 


111. Analysis of the state evolution 


A. Phase of undetectability 

A first observation we make about the general state evo¬ 
lution ( T3|T4 i is that M = Q = Q m fixed point if and 
only if the prior distribution P{x) in (|^ has a zero mean. 
This trivial fixed point corresponds to as large MSE ( [T^ as 
if all we knew about the signal X was the distribution Po{x). 
In case Q = M = 0 is the fixed point with maximum log- 
likelihood then the matrix Y did not contain any sign of the 
low rank perturbation, the information was completely lost in 
the noise, and we denote the signal X as undetectable. 

Whenever the mean of P{x) is nonzero and Po{x) = P{x) 
then for large but finite A the state evolution has a fixed 
point with MSE smaller than Epg(a:a;^). This means that 
for distributions with non-zero mean the observed matrix Y 
always contains additional (to the prior) information about the 
signal, in that case we say that the signal in detectable. In 
this sense the sparse PCA problem is harder for distributions 
having zero mean. 

We now study the linear stability of the fixed point Q = 0, 
M = 0 in the case where both P{x) and Po{x) have 
zero mean. We expand the state evolution equations around 
the trivial fixed point up to the first order in Q and M. 
Looking at the Taylor expansion of f{Q/A, Mxq/ A + W), 
it is only the term kU9p/(0, 0) that will matter for eq. ([B]., 
and Ma;o9_B/(0, 0)/A that will matter for eq. (14 1 . Realizing 
that from definition (j^ 5p/(0, 0) is the covariance of the 
distribution P{x) we get 


g‘+i = -i]Q*s, 


=-SM‘Eo, (16) 


where E and Eq are respectively the covariance matrices of 
P{x) and Po{x). 

When E = Eq the above linearization will converge away 
from the trivial fixed point for A < A^, with 

A„ = max{A^, A G Spectrum(E)} , (17) 

and the linearization is a contraction for A > A^. For 
distributions of X of zero mean, there is hence a phase 
transition in the behavior of AMP-MSE at Au- Translated into 
the behavior of the iterative AMP algorithm, when A > A^ 
AMP will converge to a trivial fixed point — 0 for all /i, 
and to a fixed point of smaller MSE for A < A„. It is quite 


remarkable to notice that this stability criteria (17 1 of the trivial 
fixed point is universal, in the sense that it does not depend 
on the details of the distributions Po{x) and P{x), it only 
requires their means to be zero and their covariances to agree. 

The phase transition at A„ (and its universality) remarkably 
reminds us of a detectability/undetectability spectral phase 
transition known for the canonical PCA fTS) , fib) . The dif¬ 
ference is that in our setting there is no such phase transition 
when P{x) has a non-zero mean. 

B. The Gauss-Bernoulli case 

A particularly interesting and in our opinion representative 
example of distribution P{x) that we will (among others) 
study in this paper is the r-variate Gauss-Bernoulli 

,.2 ' 


P{x) = Pq{x) = (1 - p)5[x) + 


p 




exp 


— X 


(18) 


here a; G K’’ and 5{x) is a r-dimensional Dirac delta function. 

Using the criteria (17 1 we conclude that for A > amp 
( randomly initialized) will not be able to detect that matrix Y 
was a noisy low-rank matrix. On the other hand for A < 
AMP will converge to a fixed point giving informative MSE. 

For rotationally invariant distributions such as (18i, i.e. 


P(x) = P{Rx) where R any orthogonal matrix, we argue 
that the rotational symmetry is also preserved in the state 
evolution. The covariance of rotationally invariant distributions 
must be proportional to an identity Ep(a;x^) = ctoI. with 
tJo G R- At the same time the order parameter Q* plays the 
role of an estimator of this covariance and hence we assume 
that for all t we have Q* = q\ with q* G K. Note that for 
rotationally invariant P{x) the signal X can be estimated only 
up to a rotation R. In what follows we will always assume 
minima over all possible R, which is equivalent to assuming 
M* = mH. The state evolution for ( [T^ and generic rank r 
can then be written explicitly as 

I g* , {q^r 

A’ A 




pq 




q" 


A4 
+ 00 

Jr{a,T) = / duPr{u) 


1-f 


A2 


■[l-/3(c 


(19) 


^)] 


"(1 + a) 


p[c 


where we used the Nishimori condition q* = to*, and 


Pr{u) = 


1 


( 271 ) 


exp - 


SrU 


r—1 


( 20 ) 
















Sr being the surface of a unit sphere in r dimensions. And 
where p{a, b) is an estimator of the probability that a given 
vector component is nonzero. 

P 


p{a,b ) = 


( 21 ) 


(l-p)exp(-2(^) (l + a)5 +p 

C. The limit of large rank r 

In this section we show that for the Gauss-Bernoulli sig¬ 
nal ( 18 I when the rank r is sufficiently large the state evolution 
has a fixed point at qr = p — A + 0 ^( 1 ) whenever A < p, and 
this fixed point has always larger log-likelihood than the trivial 
fixed point at g = 0. Moreover, we derive that the probability 
that a given component of the support is correctly discovered 
goes to one exponentially fast in r when r grows. This means, 
on the one hand, that at large r the MMSE of the sparse PCA 
behaves exactly as if the support of the non-zero elements was 
known, i.e. MSE = p for A > p and MSE = A for A < p as 
in 116|. AMP, on the other hand, will reach the MMSE only 
for A < p^ and will give MSE = p otherwise. There is hence 
a wide hard region in sparse PCA between p^ < A < p when 
r —>■ (X). 

To prove the above large-r results we analyze the function 
p{a,Tu^). Prom the definition (21 1 we can rewrite 

P 


p(c 


with 


') = 




'K{a 


P 


K{a,T) = —— -blog(l-f a). 
1 -I- a 


( 22 ) 


(23) 


Prom (20 1 we get that vf — r has zero mean and variance 2r. 
The exponential in ( |22) l is hence at large r dominated by the 
term K{a,yPr). Prom concavity of log we see that for all 
g 7 ^ 0 we have K <0. Therefore one can 

substitute p{a^Tvf) = 1 - 1 - where c S into the 

state evolution. Finally using ( [T9] l we get for r —> -|-c» and 
g 7 ^ 0 an iteration g*+^ = pg‘/(A -b g‘). When A < p this 
equations has a stable fixed point given by g = p — A. For 
A > p the only stable fixed point is g = 0. 

For the distribution ( |T8] l the log-likelihood ( [fS] ) becomes 

</> = Ep^(„){p^[|, (g + A)^] + (l-p)^(|, 


A’ A 4A ’ 
rue)] with e being an arbi- 


A’^^ ^A2- 

where fi{a,Tvf) = log[A/'(aI, 

trary vector of unit norm. 

For the fixed point g = 0 we get (?!) = 0. To compare to 
the log-likelihood of the fixed point g = p — A we develop 
'0{a, TV?) for large r and assume g 7 ^ 0. For K{a, r) < 0 we 
have '(p{a,Tvf) « log(p) — '^K{a,T), from which we obtain 


= -f 


log( 


A 


1 


1 + ^ ^ 


A 2pA 


a(l), (24) 


which evaluated at g = p — A is positive (and hence larger 
that the value for the fixed point g = 0) if and only if A < p. 
This tells us that at large enough r and A < p the fixed point 
g = p — A corresponds to the MMSE. 
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Fig. 1. Example of the discontinuous phase transitions in the MSE for 
sparse PCA. The data are for the Gauss-Bernoulli distribution jl8) , rank 
one and density p = 0.1. The lines are results of the state evolution, the 
points of the AMP algorithm run on one random instance of the problem with 
N = 20000. In blue (left most) is the MSE reached from an uninformative 
initialization of the SE/AMP. In green (right most) is the MSE reached from 
the informative initialization. In red (middle) is the MMSE. The discontinuities 
are at Aamp = 0.0100(1), Ac = 0.0153(1), Asnd = 0.0161(1). 
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Fig. 2. The phase diagram in the density p versus noise A plane for the 
Gauss-Bernoulli signal jl8) , rank r = 50. The curves from above are: In blue 
is the pAMP above which AMP is asymptotically able to find the MMSE. 
In red is the phase transition in the MMSE pc- In green is the 2nd spinodal 
P2nd below which the informative fixed point does not exist anymore. For 
A > 0.32(1) the transition is continuous and the three critical values are 
equal to pu = "s/A- 


D. Numerical Results 

In this section we give several examples of a numerical 
investigation of the fixed points of the AMP algorithm and 
of the state evolution. In all these experiments we iterate 
the corresponding equations till convergence and monitor the 
corresponding mean squared error and the value of the log- 
likelihood. We initialize the iterations in two different ways. 

• Uninformative initialization; In state evolution this means 

= e, where e is very small. In the AMP algorithm 
this corresponds to and = EpQ(a:oxJ). 

• Informative initialization; In the state evolution this 

means = Epq(xoxJ), in AMP this means = 

{xo)f^ and v^=° = 0 . 

In a region where these two initializations converge to a 
different fixed point the MMSE is the one for which the log- 
likelihood evaluated from is larger. In Fig. we compare 
the MSE reached by the state evolution and the AMP algorithm 
on an instance of size N = 20000 and as expected we see an 
excellent agreement. 

























Fig. 3. The phase diagram in the density p versus noise A plane for the 
rank-one spiked Wigner model investigated in fn, i.e. Pq{x) = p5{x — 1) + 
(1 — p)5{x). This signal distribution does not nave a zero mean and hence 
the trivial fixed point of the state evolution does not exist. This translates 
into the fact that for p > 0.041(1) the AMP is information theoretically 
optimal and there is no phase transition. For low densities we, however, again 
observe the three phase transitions, from above: In blue is the AMP spinodal 
pAMP above which AMP is asymptotically able to find the MMSE. In red 
is the discontinuous phase transition in the MMSE pc. Therefore, AMP is 
suboptimal in the region between the blue and red curve, for pc < p < pAMP- 
In green is the 2nd spinodal p 2 nd below which the informative fixed point 
does not exist anymore. 

Figs. and 1^ are for the Gauss-Bernoulli signal distribution 
for which the trivial fixed point exists and is locally stable for 
A > In this case of zero mean Po{x) there is either a 
single second order (continuous) phase transition at A„ = 
or a (discontinuous) first order phase transition in the MMSE 
at Ac > with its two spinodals, Aamp and A 2 nd- By the 
results of the previous section in the limit of large rank we 
have Ac(r —>■ oo) = A 2 nd(f —> oo) = P- Fig- [^depicts the 
result for rank r = 50. 

In Fig.we depict the result for a case of Po{x) with a non¬ 
zero mean. In that case either there is a unique fixed point and 
no phase transition, or there are two fixed points, both with 
MSE smaller than if the signal was chosen randomly only 
according to the prior information. Remarkably, the first order 
phase transition seems always to happen for small densities. 
Fig. [^depicts the case of spiked Wigner model from Q, in 
a region of densities for which the proof of Q did not apply. 
Note that the AMP spinodal pamp and the curve of phase 
transition in MMSE pc arrive with a different slope to p —> 0, 
this is consistent with the algorithmic gap for support recovery 
for sub-extensive support size ®- 

IV. Conclusions 

In this paper we analyzed probabilistic estimation in sparse 
PCA modeled by Q- We derived the state evolution of the 
approximate message passing algorithm for general rank r. 
Relying on a statistical physics conjecture about exactness of 
this state evolution we analyze it’s fixed points to compute the 
minimal mean squared error and the error made by AMP in the 
large N limit. Generalization of the proof technique of |[T] to 
small densities, ranks larger than one and signal distributions 
having zero mean are an important topic for future work. 

For signal distributions of zero mean we unveil an un¬ 
detectability regime where no algorithm can do better in 
estimation of the signal than random guessing. We observe a 


first order (discontinuous) phase transition in regions of small 
density p or large rank r. Existence of such a first order phase 
transition is related to the existence of a region of parameters 
where AMP is asymptotically sub-optimal. 

In the large rank limit the emerging picture is particularly 
simple (at least for signal distributions of zero mean). The 
asymptotic MMSE is equal to the MMSE of the problem with 
known support, and the probability of false negatives in the 
support recovery is exponentially small when r ^ oo. The 
MMSE is not reachable for AMP unless the noise variance is 
smaller than the variance of the signal squared. We expect that 
this hard region will stay computationally hard also for other 
polynomial algorithm. Proving such a result about algorithmic 
banier for a generic class of algorithms is a very interesting 
challenge. 
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